Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Multithreading for Enhanced Performance in Custom Check Processing #284

Merged
merged 6 commits into from
May 6, 2024

Conversation

rajeshpandey2053
Copy link
Contributor

@rajeshpandey2053 rajeshpandey2053 commented Apr 24, 2024

Description:

This pull request introduces a multithreading solution to enhance the performance of custom check processing in the codebase. The existing codebase comprises various rules applied to multiple fields. During execution, it became evident that parallel processing is necessary at two levels: the field level and the argument level.

At the field level, a check needs to be performed for multiple fields simultaneously. Meanwhile, at the argument level, a check traverses through multiple arguments within a single field. Therefore, nested multithreading is required to fully improve the overall performance of the project.

By leveraging multithreading at both levels, we aim to parallelize the execution of checks, thereby significantly improving performance and efficiency, especially in scenarios involving a large number of checks or resolving URLs.

Changes:

  • Introduces new functions process_argument in custom_checker.py and _process_field in checker.py to segregate the processes for parallel execution.
  • Refactors the existing loop in the run function to utilize multithreading for parallel execution of custom checks.
  • Enhances code readability and maintainability by separating concerns and encapsulating logic into reusable functions.
  • Adds appropriate documentation and comments to explain the multithreading implementation and its benefits.
  • Additional error handling for multi-threaded modules

Testing:

Extensive testing has been conducted to ensure the correctness and performance of the multithreading solution.
Integration tests have been performed to validate the code's functionality in various scenarios and edge cases.

Impact:

This change significantly improves the performance of custom check processing, especially in scenarios involving a large number of checks or resolving URLs.
Improved the efficiency of the execution on average from ~100s to ~10s

@rajeshpandey2053 rajeshpandey2053 changed the title Multithreading Implement Multithreading for Enhanced Performance in Custom Check Processing Apr 24, 2024
@rajeshpandey2053 rajeshpandey2053 self-assigned this Apr 24, 2024
Comment on lines 211 to 221
for field_dict in list_of_fields_to_apply:
executor.submit(
self._process_field,
func,
check,
rule_id,
metadata_content,
field_dict,
result_dict,
rule_mapping,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't running this in parallel lead to missing values? Eg:
result_dict = {}

Running this method in parallel will pass result_dict to the number of parallel method calls. when updating parallelly, wouldn't result_dict be missing some elements?

Not sure if pass by value takes care of the issue. proper testing (unit and manual testing) is required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did manual unit and integration tests with a curated list of concept ids, results obtained from pyquarc with and without using multithreading are exactly same.

Comment on lines 182 to 199
with ThreadPoolExecutor() as executor:
future_results = []
for arg in args:
future = executor.submit(
self._process_argument,
arg,
func,
relation,
external_data,
external_relation,
invalid_values,
validity,
)
future_results.append(future)

# Retrieve results from futures
for future in future_results:
invalid_values, validity = future.result()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't sub-threading be an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did manual testing with a curated list of concept ids, no issue at all.

@xhagrg xhagrg merged commit a0fbe59 into dev May 6, 2024
1 check passed
@xhagrg xhagrg deleted the multithreading branch May 6, 2024 17:35
@slesaad slesaad mentioned this pull request Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants